##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Here is a data on each colum about value distribution between each wine cases.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

What is clear from this table is that quality ratings only varies only from 3 to 8.2nd - density variable isn’t vary much which might not have effect for quality ratings. 3rd - We only have 1599 whine cases which is not arebig number for the analysis.

Most vines are valued in between 5 and 6 quality rating. However, there are way lower number of cases where wine ratings are below 5 and higher than 6. Quality ratings are distributed symetrically in between 3 and 8.

Fixed acidity distribution is right skewed with a peak of around 7. Values are in range 4.6 and 15.9.

Violatile acidity distribution is right skewed with a peak of 0.6 . Volatile acidity is varying from 0.12 to 1.58.

Citric acidity values varies between 0 and 1. Distribution is similar to right skewed shape with highest number of values at 0 and 0.5.

Residual sugar varies from 0.9 all the way to 16. However most values are in ragne 1.9 to 2.6. This distribution is right skewed.

Chlorides distribution is Right skewed. Most values are at 0.79 and varies from 0.012

Distribution is right skewed. Most values are about 5 and it varies from 1 to 72.

Total sulfur dioxide is also right skewed. These values varies a lot: from 6 to 289. Most values are low ones - from 6 to 70.

Density distribution is symetrical.However wine density does not vary a lot.

pH values stays in acid range. It varies from 2.7 to 4.1. Values distributed symetrically.

Sulphate values distributed in right skewed shape.Values ranges from 0.33 to 2. Most values are between 9 and 9.5.

Looks like alcohol values are also distributed in right skewed shape. Values ranges 8.49 to 14.90. Looks like most values are between 9 and 10.

Univariate Analysis

What is the structure of your dataset?

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the of the Portuguese “Vinho Verde” red wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).According to these ratings we will try find what properties of wine that make wine get the highest ratings. Here we can see all collumns.

What is/are the main feature(s) of interest in your dataset?

Chemical features of wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

None of the atributes are stand out. However alcohol, fixed acidity, violatile acidity, citric acidity, free & total sulfur dioxides are interest of mine because these atributes varies in wider range.

Did you create any new variables from existing variables in the dataset?

I haven’t created yet.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

We checked for misisng values, however none of them were identified.

Bivariate Plots Section

Because nothing stands out, we will try to plot every variable with quality atrribute.

As we can see violatile acidity tend to be lower, when quality ranking is going higher.

Mean and median values are almost the same accross all quality ratings.

Citric acidity tend to be higher in higher quality ratings.

Looking to quality ratings, residual sugar stays almost the same accros the ratings.

Chlorides also stays the same.

Density has a slight decrease, however it’s values varies in very small numbers.

Total sulfur dioxide remains almost the same except of slight peak at rating of 5.

pH is tend to be slighty more acid

Sulphates has a small increase in higher quality rating wines.

Average and median values of alchocol tend to be almost the same from ratings 3 to 5. However in higher quality ratings wines achohol values tend to be higher.

Bivariate Analysis

Most median and mean values from attributes doesn’t vary alot or stays the same in all ratings. However, alcohol, volatile.acidity, citric.acid, sulpahtes stands out from all atributes by having most changes accross quality ratings.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Alcohol has the strongest relationship.

Multivariate Plots Section

We will try to group all quality ratings in three groups: - Low rating group (ratings in range 3 to 5) - Midle rating group (ratings in range 5 to 6) - High rating group (ratings in range 6 to 8)

The reason of middle rating group to be “narrow” is because most of wines gets ratings between 5 and 6.

As wecan see, green dots shifted to the right compared yellow ones. That shows wines that has higher quality rating might tend to have higher alcohol quantity. Also, higher quality wines tend to have slightly higher sulphates quantities.

This graph indicates that lower rating group tend to have higher violatile acidity and lower alcohol rating comapred to higher rating wines.

Here we can see that “lower quality” wines tend to have low and high values of citric acidity. However, “higher quality” wines a tend to have slighty more values in higher values of citric acidity.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = pf)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = pf)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + citric.acid, 
##     data = pf)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + citric.acid + 
##     sulphates, data = pf)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)           1.875***      3.095***      3.055***      2.646***  
##                        (0.175)       (0.184)       (0.194)       (0.201)    
##   I(alcohol)            0.361***      0.314***      0.314***      0.309***  
##                        (0.017)       (0.016)       (0.016)       (0.016)    
##   volatile.acidity                   -1.384***     -1.343***     -1.265***  
##                                      (0.095)       (0.114)       (0.113)    
##   citric.acid                                       0.068        -0.079     
##                                                    (0.103)       (0.104)    
##   sulphates                                                       0.696***  
##                                                                  (0.103)    
## ----------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.317         0.336     
##   adj. R-squared        0.226         0.316         0.316         0.334     
##   sigma                 0.710         0.668         0.668         0.659     
##   F                   468.267       370.379       246.976       201.777     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1621.596     -1599.093     
##   Deviance            805.870       711.796       711.603       691.852     
##   AIC                3448.114      3251.628      3253.192      3210.186     
##   BIC                3464.245      3273.136      3280.078      3242.448     
##   N                  1599          1599          1599          1599         
## ============================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Features strengthen each, but not by much. Also alcohol and citric acidity has positive relationship whereas violatile.acidity - negative relationship.

Were there any interesting or surprising interactions between features?

The interesting findings is that how small affect has each of theses atributes. Anotehr suprising thing - looks like strongest correlation has an alcohol compared to ohter atributes.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, we did linear regression with variables that looks like have most effect. However results shows none of variables has an effect on wine ratings. ——

Final Plots and Summary

Plot One

Description One

As we cans see ratings are distributed almost in perfect noraml distribution. But even though ratings can be from 0 to 10, in real word it varies from 3 to 8. Also there are no float ratings, because wines are evaluated in teger values.

Plot Two

Description Two

Looks like alcohol has a biggest effect on overall rating. Hewever it seems that mean and median of alcohol values in quality ratings 3 to 5 remains almost stable. But from rating 5, alcohol mean and median tends to rise. It seems that on higher quality ratings, lowest alcohol values tend to grow up whereas highest alcohol values remains the same acrross all ratings that are 5 or higher.

Plot Three

Description Three

In the last chart we compare Quality index with quality ratings. In perfect word with every higher quality range (ratings 3-5, 5-6, 6-8), all wine should be in higher position in the graph due to higher quality index. However this is not the case adn perhaps that’s why the correlation between wine atributes and quality ratings are low. As we can see lowest quality index in all cathegories are the same through low middle and high rating groups. But highest quality indexes has small difference: low quality ratign group tend to have lower values whereas middle and lower high rating wines tend to have the highest quality index values. What’s interesting is the highest quality rating wines tend have lower values - all most the same as of quality rating 4. ——

Reflection

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). During this analysis I tried to look at histograms of every variable in order to get sense how values are distributed and how far they fluctuate. Then I compared every chemical variable with quality rating with that. The goal was to identify patterns that might affect rating values. After variables being chosen, i check how much effect they have in linear model as well as making plots with quality rating groups.

Before starting this analysis I was hoping to find clear patterns of some variables affecting quality ratings. After seeing some histograms it was clear that some variables like density, pH has very low fluctuation. Also they are a byproduct of other variables (Density changes when fluid chemistry differs; pH changes when we have different amounts of acid or base elements. A lot of other variables are either acid or basic). When comparing quality rating with each every variable, there wasn’t any clear tendency. Most promising variable were alcohol, volatile acidity, citric acid and sulfates. However, when analyzing further with quality rating categories, comparing quality rating with quality index (calculated from chosen variables), patterns that chosen variables would affect quality ratings were minimal, if any. Lastly linear model supported my findings: correlation was low. The findings were disappointing since I was expecting different results.

As any analysis it has some flaws that might influence wrong results. This might be due to these reasons: - Because wines were evaluated by humans, rating could be influenced by subject feelings and tastes which this data not taking into account. Also psychological state of the members during testing can also make rating subjective.

  1. There might be other attributes that aren’t reflected in the data set. For example color of wine, taste preferences of members of wine testing commission.
  2. Small sample data. The lack of data was clear especially on low and high quality ratings;
  3. All wines were made by 1 company. Diversity of wine suppliers would make data applicable about all red wine market. It would be nice to expand this analysis with bigger, more diverse data set.

In general I’ve found this data analysis expand my view on wines. It also sharpen my data analysis skills and R language programming skills. Although it wasn’t easy, it was worth it.